12 research outputs found

    Learning from class-imbalanced data: overlap-driven resampling for imbalanced data classification.

    Get PDF
    Classification of imbalanced datasets has attracted substantial research interest over the past years. This is because imbalanced datasets are common in several domains such as health, finance and security, but learning algorithms are generally not designed to handle them. Many existing solutions focus mainly on the class distribution problem. However, a number of reports showed that class overlap had a higher negative impact on the learning process than class imbalance. This thesis thoroughly explores the impact of class overlap on the learning algorithm and demonstrates how elimination of class overlap can effectively improve the classification of imbalanced datasets. Novel undersampling approaches were developed with the main objective of enhancing the presence of minority class instances in the overlapping region. This is achieved by identifying and removing majority class instances potentially residing in such a region. Seven methods under the two different approaches were designed for the task. Extensive experiments were carried out to evaluate the methods on simulated and well-known real-world datasets. Results showed that substantial improvement in the classification accuracy of the minority class was obtained with favourable trade-offs with the majority class accuracy. Moreover, successful application of the methods in predictive diagnostics of diseases with imbalanced records is presented. These novel overlap-based approaches have several advantages over other common resampling methods. First, the undersampling amount is independent of class imbalance and proportional to the degree of overlap. This could effectively address the problem of class overlap while reducing the effect of class imbalance. Second, information loss is minimised as instance elimination is contained within the problematic region. Third, adaptive parameters enable the methods to be generalised across different problems. It is also worth pointing out that these methods provide different trade-offs, which offer more alternatives to real-world users in selecting the best fit solution to the problem

    On the class overlap problem in imbalanced data classification.

    Get PDF
    Class imbalance is an active research area in the machine learning community. However, existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. This paper provides detailed critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy. First, we present a thorough experimental comparison of class overlap and class imbalance. Unlike previous work, our experiment was carried out on the full scale of class overlap and an extreme range of class imbalance degrees. Second, we provide an in-depth critical technical review of existing approaches to handle imbalanced datasets. Existing solutions from selective literature are critically reviewed and categorised as class distribution-based and class overlap-based methods. Emerging techniques and the latest development in this area are also discussed in detail. Experimental results in this paper are consistent with existing literature and show clearly that the performance of the learning algorithm deteriorates across varying degrees of class overlap whereas class imbalance does not always have an effect. The review emphasises the need for further research towards handling class overlap in imbalanced datasets to effectively improve learning algorithms’ performance

    A data-driven decision support tool for offshore oil and gas decommissioning.

    Get PDF
    A growing number of oil and gas offshore infrastructures across the globe are approaching the end of their operational life. It is a major challenge for the industry to plan and make a decision on the decommissioning as the processes are resource exhaustive. Whether a facility is completely removed, partially removed or left in-situ, each option will affect individual parties differently. Stakeholders’ concerns and needs are collected and analyzed to obtain the most compromised decommissioning decision. Engaging with hundreds of stakeholders is extremely complicated, hence time-consuming and costly. This issue can be addressed using a predictive model to provide suggested decommissioning options based on the data of previously approved projects. However, the lack of readily available relevant datasets is the main hindrance of such an approach. In this paper, we introduce a new oil and gas decommissioning dataset extensively covering all types of offshore infrastructures in the UK landscape over a 21-year period. An experimental framework using several learning algorithms on the new dataset for predicting the decommissioning option is presented. Various resampling methods were applied to tackle the imbalanced class distribution of the dataset for improved classification. Promising results were achieved despite the exclusion of some stakeholder-related features used in the traditional approach. This shows signs of a potential solution for the industry to significantly reduce time and cost spent on a decommissioning project, and encourages more efforts put into researching on this timely topic

    Antimicrobial Resistance and Machine Learning: Challenges and Opportunities

    Get PDF
    Antimicrobial Resistance (AMR) has been identified by the World Health Organisation (WHO) as one of the top ten global health threats. Inappropriate use of antibiotics around the world and in particular in Low-to-Middle-Income Countries (LMICs), where antibiotics use and prescription are poorly managed, is considered one of the main reasons for this problem. It is projected that the COVID-19 pandemic will accelerate the threat of AMR due to the increasing use of antibiotics across the world, and especially in countries with limited resources. In recent years, machine learning-based methods showed promising results and proved capable of providing the necessary tools to inform antimicrobial prescription and combat AMR. This timely paper provides a critical and technical review of existing machine learning-based methods for addressing AMR. First, an overview of the AMR problem as a global threat to public health, and its impact on countries with limited resources (LMICs) are presented. Then, a technical review and evaluation of existing literature that utilises machine learning to tackle AMR are provided with emphasis on methods that use readily available demographic and clinical data as well as microbial culture and sensitivity laboratory data of clinical isolates associated with multi-drug resistant infections. This is followed by a discussion of challenges and limitations that are considered barriers to scaling up the use of machine learning to address AMR. Finally, a framework for accelerating the use of AMR data-driven framework, and building a feasible solution that can be realistically implemented in LMICs is presented with a discussion of future directions and recommendations

    Computer vision and machine learning for medical image analysis: recent advances, challenges, and way forward.

    Get PDF
    The recent development in the areas of deep learning and deep convolutional neural networks has significantly progressed and advanced the field of computer vision (CV) and image analysis and understanding. Complex tasks such as classifying and segmenting medical images and localising and recognising objects of interest have become much less challenging. This progress has the potential of accelerating research and deployment of multitudes of medical applications that utilise CV. However, in reality, there are limited practical examples being physically deployed into front-line health facilities. In this paper, we examine the current state of the art in CV as applied to the medical domain. We discuss the main challenges in CV and intelligent data-driven medical applications and suggest future directions to accelerate research, development, and deployment of CV applications in health practices. First, we critically review existing literature in the CV domain that addresses complex vision tasks, including: medical image classification; shape and object recognition from images; and medical segmentation. Second, we present an in-depth discussion of the various challenges that are considered barriers to accelerating research, development, and deployment of intelligent CV methods in real-life medical applications and hospitals. Finally, we conclude by discussing future directions

    Symbols in engineering drawings (SiED): an imbalanced dataset benchmarked by convolutional neural networks.

    Get PDF
    Engineering drawings are common across different domains such as Oil & Gas, construction, mechanical and other domains. Automatic processing and analysis of these drawings is a challenging task. This is partly due to the complexity of these documents and also due to the lack of dataset availability in the public domain that can help push the research in this area. In this paper, we present a multiclass imbalanced dataset for the research community made of 2432 instances of engineering symbols. These symbols were extracted from a collection of complex engineering drawings known as Piping and Instrumentation Diagram (P&ID). By providing such dataset to the research community, we anticipate that this will help attract more attention to an important, yet overlooked industrial problem, and will also advance the research in such important and timely topics. We discuss the datasets characteristics in details, and we also show how Convolutional Neural Networks (CNNs) perform on such extremely imbalanced datasets. Finally, conclusions and future directions are discussed

    Response to discussion on “Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and Parkinson’s disease.”

    Get PDF
    In the paper 'Improved Overlap-Based Undersampling for Imbalanced Dataset Classification with Application to Epilepsy and Parkinson's Disease', the authors introduced two new methods that address the class overlap problem in imbalanced datasets. The methods involve identification and removal of potentially overlapped majority class instances. Extensive evaluations were carried out using 136 datasets and compared against several state-of-the-art methods. Results showed competitive performance with those methods, and statistical tests proved significant improvement in classification results. The discussion on the paper related to the behavioral analysis of class overlap and method validation was raised by Fernández. In this article, the response to the discussion is delivered. Detailed clarification and supporting evidence to answer all the points raised are provide

    Neighbourhood-based undersampling approach for handling imbalanced and overlapped data.

    Get PDF
    Class imbalanced datasets are common across different domains including health, security, banking and others. A typical supervised learning algorithm tends to be biased towards the majority class when dealing with imbalanced datasets. The learning task becomes more challenging when there is also an overlap of instances from different classes. In this paper, we propose an undersampling framework for handling class imbalance in binary datasets by removing potential overlapped data points. Our methods are designed to identify and eliminate majority class instances from the overlapping region. Accurate identification and elimination of these instances maximise the visibility of the minority class instances and at the same time minimises excessive elimination of data, which reduces information loss. Four methods based on neighbourhood searching with different criteria to identify potential overlapped instances are proposed in this paper. Extensive experiments using simulated and real-world datasets were carried out. Results show comparable performance with state-of-the-art methods across different common metrics with exceptional and statistically significant improvements in sensitivity

    Computer vision and machine learning for medical image analysis: recent advances, challenges, and way forward

    Get PDF
    The recent development in the areas of deep learning and deep convolutional neural networks has significantly progressed and advanced the field of computer vision (CV) and image analysis and understanding. Complex tasks such as classifying and segmenting medical images and localising and recognising objects of interest have become much less challenging. This progress has the potential of accelerating research and deployment of multitudes of medical applications that utilise CV. However, in reality, there are limited practical examples being physically deployed into front-line health facilities. In this paper, we examine the current state of the art in CV as applied to the medical domain. We discuss the main challenges in CV and intelligent data-driven medical applications and suggest future directions to accelerate research, development, and deployment of CV applications in health practices. First, we critically review existing literature in the CV domain that addresses complex vision tasks, including: medical image classification; shape and object recognition from images; and medical segmentation. Second, we present an in-depth discussion of the various challenges that are considered barriers to accelerating research, development, and deployment of intelligent CV methods in real-life medical applications and hospitals. Finally, we conclude by discussing future directions
    corecore